Rom Hacking 201: A non-ASM approach to compressed data
Notes on きかんしゃトーマス ソドーとうのなかまたち for GameBoy Color
Last Updated: 2022-05-08, abridgewater

* Preface

Prompted by TF1945 asking for and recieving advice on how to translate
this game, Bunkai began producing how-to documents for parts of the
romhacking process based on the Discord chat logs.  It turns out that
some of the game's graphics, particularly the title screen graphics,
are compressed, so I thought I'd try to document a "differential
analysis" attack on the compression scheme, where I start with the ROM
and the decompressed data from VRAM, find the compressed data in the
ROM, and determine what I can of its encoding (those familiar with
cryptography would probably consider this to be a "known plaintext
attack").

What good is working out the compression scheme this way?  You could
use the information to encode new data by hand with a hex editor, but
that would be incredibly tedious.  A programmer could write stand
alone decompression and compression programs without needing to
reverse engineer the decompression routine used by the game itself
(meaning that your hypothetical programmer does not need to be able to
do assembly hacking).  And knowing the basics of the compression
scheme would make it easier for an assembly hacker to figure out how
the decompression routine works.

* Acknowledgements

TF1945 for prompting the entire exercise.  Bunkai for deciding to use
TF1945's questions and the community response as a teaching moment and
for feedback.  256, 343, and YasaSheep for feedback and finding out
more about the compression scheme.  Bootleg Porygon for feedback.
Lusoneko, Pluvius, and Calico for contributing to the series and
project more generally.

And apologies to anyone that I forgot.  Mea culpa.

* Investigating the Title Screen

Step one, find our decompressed data.  Start the ROM in an emulator
that has debugging support, get to the title screen, and pause the
emulation.  Next, use the tile viewer to find the address of some of
the decompressed tiles in VRAM, then point a memory viewer at that
address to see the decompressed data bytes.

#+begin_src fundamental
  9000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  9010: 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF
  9020: 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 01 FE
  9030: 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 38 C7 FE 01
  9040: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
  9050: 00 FF 00 FF 00 FF 00 FF 00 FF 0E F1 3F C0 7F 80
  9060: 07 F8 0F F0 1F E0 1F E0 0F F0 03 FC 00 FF 01 FE
  9070: FF 00 FF 00 FF 00 FE 01 F8 07 80 7F 00 FF E0 1F
  9080: 01 FE 03 FC 07 F8 07 F8 07 F8 03 FC 03 FC 01 FE
  9090: C3 3C EF 10 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00
  90A0: C0 3F F8 07 FC 03 FF 00 FF 00 FF 00 FF 00 FF 00
#+end_src

That should be enough data to start with.  Now, this is a very regular
pattern in the beginning, so we'll skip ahead to a significant break
in the pattern: At $903c we have the sequence "38 c7 fe 01", the first
few bytes that aren't all-bits-set or all-bits-clear or very close to
it.

Next, we look through the ROM to try and find this sequence.  I start
by looking for "38 c7", and if that fails then I'll look for "c7 fe".
These are deliberately small in order to increase the chance of
finding a match, at the cost of having a number of "false positive"
matches to weed out.  My tools find twenty-two occurrences of "c7 fe"
in the ROM, though (due to tool limitations) they won't find any
occurrences that straddle an 8-byte boundary.  Not too long of a list,
and if I had an overview of what's where in the ROM from other
analysis, I could use that to try to eliminate candidates that way.

First match, $1102b.

#+begin_src fundamental
  00011020  00 01 00 01 00 01 00 f8  07 c0 3f 38 c7 7c 83 7c  |..........?8.|.||
#+end_src

Not it.  Next match, $1111b.

#+begin_src fundamental
  00011110  00 01 00 01 00 01 00 f8  07 c0 3f 38 c7 7c 83 7c  |..........?8.|.||
#+end_src

Not it.  Next match, $14128.

#+begin_src fundamental
  00014120  00 01 00 01 01 00 00 ff  38 c7 27 c0 20 c0 20 c0  |........8.'. . .|
#+end_src

Not it.  Next match, $21169.

#+begin_src fundamental
  00021160  83 22 7e 8a 22 7e 88 22  21 38 c7 06 00 2a 5f 2a  |."~."~."!8...*_*|
#+end_src

Not it.  Next match, $4c012.

#+begin_src fundamental
  0004c010  09 9f 38 c7 fe 01 ff 00  0c 3f 07 0e ff f1 3f c0  |..8......?....?.|
#+end_src

Jackpot!  And we note the pattern from VRAM $905a, "0e f1 3f c0"
appears here at $4c01b as "0e ff f1 3f c0", further confirming that
this is the data that we're looking for.  That "ff" is a bit
suspicious, but we'll get to that.  We'll just grab a bit of this data
for deeper analysis.

#+begin_src fundamental
  0004bff0  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff 00  |................|
  0004c000  00 08 34 05 65 00 00 0d  ff 01 0f 13 07 01 fe 1f  |..4.e...........|
  0004c010  09 9f 38 c7 fe 01 ff 00  0c 3f 07 0e ff f1 3f c0  |..8......?....?.|
  0004c020  7f 80 07 f8 0f ff f0 1f  e0 1f e0 0f f0 03 f9 fc  |................|
  0004c030  3f 01 5e 03 fe 01 f8 07  80 ff 7f 00 ff e0 1f 01  |?.^.............|
  0004c040  fe 03 f7 fc 07 f8 01 01  03 fc 03 fc bf 01 fe c3  |................|
  0004c050  3c ef 10 82 09 c0 9f 3f  f8 07 fc 03 94 07 9f 05  |<......?........|
  0004c060  80 f1 7f 01 01 91 01 3f  03 0f f0 0c f3 ff 0c f3  |.......?........|
  0004c070  0e f1 fe 01 fc 03 ff e0  1f c0 3f 00 ff 0c f3 3f  |..........?....?|
#+end_src

At this point we have a chunk of decompressed data and a couple of
places where the decompressed data matches up with data in the ROM,
meaning that we've found approximately where the compressed data sits.
We don't yet know any details of the compression scheme other than
that it includes literal data at times, and we don't know where the
compressed data starts or where it ends.

To find either end of the compressed data, we want to look forward or
backwards through the ROM until we find the compressed data that
corresponds with the start or the end of the decompressed data.  This,
in turn, requires that we work out at least the basics of the overall
compression scheme.

And the compressed data includes literal data, so what we do is to
line it up with the decompressed data and look for patterns.  What's
in the compressed data that's not in the decompressed data, and what's
in the decompressed data that's not in the compressed data?  Our
starting point is the first match we found, and we work our way
forward from there.

VRAM $903c: 38 c7 fe 01 ff
ROM $4c012: 38 c7 fe 01 ff

VRAM $9041: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
VRAM $9050: 00 ff 00 ff 00 ff 00 ff 00 ff
ROM $4c017: 00 0c 3f 07

VRAM $905a: 0e    f1 3f c0 7f 80 07 f8 0f    f0 1f e0 1f e0 0f f0 03
ROM $4c01b: 0e ff f1 3f c0 7f 80 07 f8 0f ff f0 1f e0 1f e0 0f f0 03

There's a pattern here.  The compressed data has an "ff" byte followed
by eight bytes of literal data, more than once.

VRAM $906b:    fc             00 ff 01 fe ff 00 ff 00 ff 00 fe 01 f8 07 80
ROM $4c02e: f9 fc 3f 01 5e 03                               fe 01 f8 07 80

VRAM $907b:    7f 00 ff e0 1f 01 fe 03
ROM $4c039: ff 7f 00 ff e0 1f 01 fe 03

VRAM $9083:    fc 07 f8       07 f8 07 f8 03 fc 03 fc    01 fe c3
ROM $4c042: f7 fc 07 f8 01 01             03 fc 03 fc bf 01 fe c3

A break from pattern at $4c02e, a return to pattern at $4c039, and a
break again at $4c042...  but there are clearly still literal matches
here mixed in with runs of non-literal data.  What's the larger
pattern?

It appears that we have a control byte of some sort, and that affects
how the following data is interpreted.  For every bit set in the
control byte, there is a byte of literal data following.  And the
clear bits in the control byte signify some other operation, producing
more decompressed data than there is corresponding compressed data.

Taking this insight a step further, it seems that control bits are
consumed LSB-first (note the $f9 at ROM $4c02e followed by a single
literal byte, something else, and then five more literal bytes), and
that each clear control bit corresponds to two bytes in the compressed
data.

So, what do these two-byte sequences mean?  Well, whenever we see one
we can see that it corresponds to a repetition of something from
earlier in the decompressed data, making it a backreference, and the
entire compression scheme one of the many Lempel-Ziv variants.
Overall, this means that the bytes have to encode a /length/ (how many
bytes to copy) and a /distance/ (how far back in the decompressed data
to start copying).

The number of bits used for both length and distance can vary from
compression scheme to compression scheme, and affect how long of a
length can be encoded and how far back in the decompressed data a
reference can reach.  In this case, however, it looks like a nice,
easy to handle eight bits each.  [Spoiler: It isn't.  See below... or
the next document.]

Just from looking at the examples we have above, it appears that the
first byte is the position, and is /one less/ than the (backwards!)
offset from which to start copying; and the second byte is the length
and is /three less/ than the number of bytes to copy.

Why is the position value one less than the offset?  Because a zero
offset would be the /next/ byte to output, which doesn't make any kind
of sense, and biasing the position gives that little extra bit of
range for a backreference.  Why is the length value three less than
the number of bytes to copy?  A backreference costs seventeen bits to
encode (one control bit to signify the backreference, two bytes to
encode the position and length), while literal bytes cost nine bits
each (one control bit to signify the literal, one byte for the
literal).  A length of zero or one costs more to encode than using
literals (zero or nine bits vs. seventeen bits).  A length of two is a
very slight savings compared to using literals (eighteen bits
vs. seventeen bits).  And the same argument for that little extra bit
of range applies as for position.

So that's the basics of the scheme.  There are still two open
questions.  First, how does the decompressor know when to stop
decompressing?  Second, where does the compressed data start?

There are two ways for the decompressor to know when to stop.  Either
it knows how long the decompressed data will be and stops when it
fills that space, or there is a special end marker that it looks for
in the compressed data.  The end marker is the more likely of the two,
and the only space in the encoding scheme that it can fit is as a
backreference with a particular value.  This is an easy enough thing
to determine: Walk forward through the compressed data until you reach
the end.  If it just stops without any unusual backreference, it's
length-limited, and the length needs to be found.  If there /is/ an
unusual backreference, then that's the end marker.  Ultimately, I'm
not too fussed about this, so I'll leave it for someone else to find.
[Spoiler: It turned out to be a length limit.]

As far as finding the start of the compressed data goes, we had a
solid match at ROM $4c012 with "38 c7 fe 01" to VRAM $9038 the same.
The immediately preceding ROM byte is "9f", which would be a control
byte that calls for five literal bytes, two backreferences, and a
further literal, which is precisely the pattern that we found leading
up to the blocks of eight literal bytes.  This puts VRAM $903c and ROM
$4c011 at the point of needing to fetch a control byte:

#+begin_src fundamental
  9000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  9010: 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF
  9020: 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 01 FE
  9030: 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 38 C7 FE 01
#+end_src

#+begin_src fundamental
  0004bff0  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff 00  |................|
  0004c000  00 08 34 05 65 00 00 0d  ff 01 0f 13 07 01 fe 1f  |..4.e...........|
  0004c010  09 9f 38 c7 fe 01 ff 00  0c 3f 07 0e ff f1 3f c0  |..8......?....?.|
#+end_src

So, we work our way backwards.  The preceding byte in VRAM is "ff".
The preceding byte in ROM is "09".  These are not the same, so we're
definitely looking at a backreference, "1f 09".  This is twelve ($0c)
bytes copied, bringing us to VRAM $9030 and ROM $4c00f.  Next, we have
two bytes that are the same in both VRAM and ROM, so probably literal
bytes, bringing us to VRAM $902e and ROM $4c00d.  The preceding byte
in VRAM is again "ff", while ROM is "07", signifying a backreference
"13 07".  This is ten ($0a) byte copied, bringing us to VRAM $9024 and
ROM $4c00b.  We've also seen a backreference, two literal bytes, and a
backreference, meaning that the eventual control byte that we find
will have an upper nybble of 0110 binary.  There are four more items
to find before that control byte.

Continuing on, the precdeing VRAM byte is again "ff", while the
preceding ROM byte is "0f", so we have a backreference "01 0f".  This
is a length of eighteen ($12) bytes, putting us at VRAM $9012 and ROM
$4c009.  And here is where I missed a trick in my initial analysis:
The VRAM contents here is a repeated run of "00 ff" for /both/
backreferences, with a total length easily represented in eight bits,
so why is it /two/ backreferences?

Clearly, the backreference length is limited to four bits (three to
eighteen bytes length), and the other four bits in that second byte
are used for something else.  Within the context of an LZ compression
scheme, this means either an escape such as end-of-data or additional
position bits.  [Spoiler: it turned out to be additional position
bits.  See the next document.]

VRAM $9011 is "ff", ROM $4c008 is "ff".  This strongly suggests a
literal byte.  VRAM $9010 is "00", ROM $4c007 is "0d".  This would be
"00 0d", a sixteen ($10) byte backreference, bringing us to VRAM $9001
and ROM $4c007.  And we've seen seven items (a backreference, two
literals, two backreferences, a literal, and a backreference), so
there's just the one item left before a control byte.  And it's a "00"
in both VRAM and ROM, meaning a probable literal.  This brings us to
the start of the VRAM area that we were looking at, and we should
expect to (and do!) see a control byte of "65" at ROM $4c004.  The
compressed data stream thus starts at ROM $4c004.

Four bytes into a 16k bank is an unusual starting point.  There /are/
cases where it is a natural boundary, but they involve specific banks
(typically the first or last bank of a ROM) and specific CPU semantics
(fixed interrupt or cold-start/reset entry points).  While some of
these /do/ apply to the GameBoy CPU, they /do not/ apply to bank $13
(which is where the compressed data is stored), so we're probably
looking at a header.  The meaning of the probable-header, "00 08 34
05", isn't immediately obvious, though the least uncomfortable
interpretation is as two 16-bit words $0800 and $0534, one or both of
which might be some sort of flags or compression type, or could be a
length.

* Retrospective: What Was Found Since

Within a day or so of writing the initial version of this document,
256 informed me of a few additional things, and then YasaSheep took
things even further, both using debugger and disassembly techniques
(what I consider to be "dynamic" code analysis).  So, what did they
find, and how could it have been found without having to break out the
disassembler?

Let's start with information from 256:

 - The four byte header before the compressed data stream is a 16-bit
   offset relative to the load address passed to the decompression
   routine and the length of the compressed data stream.

   Taking the easy one first, the length of the compressed data stream
   is how the decompression routine determines when to stop, and while
   it being determined by the /compressed/ data length would have been
   a mild surprise it would have been discovered by the process
   outlined for finding the terminator.  As said towards the end of
   the previous section, "it's length-limited, and the length needs to
   be found".

   The offset, in this case $0800, is trickier.  With just the one
   block of compressed data, it would be meaningless.  With multiple
   compressed data blocks there might be a pattern to it that would
   make it obvious.  Or a corruption attack might reveal something.
   In both cases there's a risk of missing a significant contributing
   factor (about which more below).

   There's a potential confusion for the particular compressed data
   block under analysis: The /decompressed/ length appears to be $0800
   bytes, so mistaking the offset for a length is possible.  This
   would have been revealed via a corruption attack or analyzing a
   second block of compressed data (one with a decompressed length
   that differs from its offset).  Alternately, maybe it /is/ the
   length of the decompressed data after all?

 - There is a table of compressed data blocks to load and where to
   load them per "state" of some sort of state machine, comprising of
   a 16-bit source address, an 8-bit source ROM bank, a 16-bit target
   address, and an 8-bit target VRAM bank, repeated for however many
   compressed blocks, and terminated by three $ff bytes.  The specific
   address given for one of these tables was $00:1360 for the title
   screen.

   ROM file offset $4c000 is CPU address $4000 in bank $13, or
   $13:4000.  This is a /lousy/ address to try and find by searching.
   That said, it is good as additional validation that we have found
   the correct table.

   A quick check in the tile viewer shows three or four distinct
   regions loaded as part of the title screen, two of which are tiles
   and two of which are not (almost certainly the tile /map/).
   Finding the compressed data for even one of these like we did for
   the initial tile area would raise our odds of finding the table
   considerably.

   Searching for the CPU address of one of the compressed blocks and
   finding its bank ID, the CPU addresses of the other blocks, and the
   bank IDs of those other blocks in proximity would nail down the
   location of the table.  The terminator of three $ff bytes is at
   that point obvious.  From there we see that there are three bytes
   unaccounted for associated with each compressed block, and their
   values would strongly suggest VRAM addresses.  Corrupting these
   would serve as a final confirmation.

   Matching the VRAM addresses in the table to the actual loaded
   addresses for the data shows general agreement.  Including the
   first compressed data block, the one that we examined above, which
   suggests that the first word of the compressed data header is /not/
   some kind of address offset after all.

   A second angle for all this is that it's predicated on having
   already found the end of the first compressed data block, and data
   that is used together or is of the same kind is often stored
   together.  This means that data that is stored together /can/ be
   (but is not necessarily) related by more than mere proximity.  And
   it turns out that the next three chunks in the ROM file are the
   other three compressed data blocks for the title screen, so we
   might have found the addresses for the table that way.

256 also provided the address of the second compressed data block, the
address of the decompression routine, and the register and machine
state requirements for calling the decompression routine.  Clearly, we
could have found the compressed data block, but the information about
the decompression routine cannot be found without some form of code
analysis.

YasaSheep went and found most of the same things that 256 did, but
also one more thing (actually the focus of their investigation):

 - Backreferences are a 12-bit offset and a 4-bit length, the top four
   bits of the offset being stored in the upper four bits of what
   superficially appears to be the length byte.  This is a distinct
   difference from what I claimed previously.

   YasaSheep actually found this because an initial attempt at writing
   a decompressor in some way didn't work out, presumably producing
   obviously incorrect output, leading to a code-level analysis of the
   in-game decompressor.

   Could this have been figured out from comparing the compressed and
   decompressed data?  Yes.  The time-consuming part is actually
   finding cases where it occurs.  Once a place is found where a
   decompressor fails to match what the "reference implementation"
   (the game's decompressor) does, along with what the correct output
   and the compressed data is, differential analysis applies.

   The critical problem is realizing that there /is/ a disconnect
   between the model of how the compression scheme works and how the
   compression scheme actually works.  Code analysis doesn't leave
   these ambiguities, what the code does is what the code does, but it
   does involve its own tools and techniques.

   What are other ways to have found this error?  Writing a compressor
   that encodes a run longer than eighteen bytes would trigger a
   similar failure in that the decompressor would produce unexpected
   output.  Finding a place in existing data that encodes such a run
   could also indicate that at least /something/ wasn't quite right,
   since it would be two or more backreferences for no immediately
   apparent reason, but it requires paying careful attention rather
   than moving quickly, hence why I missed the critical clue during my
   initial analysis.

Overall, this approach works well for Run-Length Encoding and
Lempel-Ziv style byte-oriented compression.  It might also work for
dictionary compression as can be used for script text in RPGs and the
like, though finding the dictionary vs. the compressed text might be
an issue there.  I don't know that it would work all that well with a
lossy compression, or a bit-stream compression such as certain Huffman
compression schemes.

* EOF